Unsupervised Blocking of Imbalanced Datasets for Record Matching

نویسندگان

  • Chenxiao Dou
  • Daniel Sun
  • Raymond K. Wong
چکیده

Record matching in data engineering refers to searching for data records originating from same entities across different data sources. The solutions for record matching usually employ learning algorithms to train a classifier that labels record pairs as either matches or nonmatches. In practice, the amount of non-matches typically far exceeds the amount of matches. This problem is so-called imbalance problem, which notoriously increases the difficulty of acquiring a representative dataset for classifier training. Various blocking techniques have been proposed to alleviate this problem, but most of them rely heavily on the effort of human experts. In this paper, we propose an unsupervised blocking method, which aims at automatic blocking. To demonstrate the effectiveness, we evaluated our method using real-world datasets. The results show that our method significantly outperforms other competitors.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An unsupervised self-organizing learning with support vector ranking for imbalanced datasets

The aim of computational learning algorithm is to establish grounds that work for any types of data, once and for all. However, majority of the classifiers have their base from balanced datasets. This paper discusses the issues related to imbalanced data distribution problem and the common strategy to deal with imbalance datasets. We propose a model capable of handling imbalance datasets well i...

متن کامل

Using Self-organizing Maps for Binary Classification with Highly Imbalanced Datasets

Highly imbalanced datasets occur in domains like fraud detection, fraud prediction, and clinical diagnosis of rare diseases, among others. These datasets are characterized by the existence of a prevalent class (e.g. legitimate sellers) while the other is relatively rare (e.g. fraudsters). Although small in proportion, the observations belonging to the minority class can be of a crucial importan...

متن کامل

Facial Emotion Ranking Under Imbalanced Conditions

The aim of emotion recognition is to establish grounds that work for different types of emotions. However, majority of the classifiers have their base from balanced datasets. There are few works that attempts to address how to approach facial emotion recognition under imbalanced condition. This paper discusses the issues related to imbalanced data distribution problem and the common strategy to...

متن کامل

Entity Matching on Web Tables: a Table Embeddings approach for Blocking

Entity matching, or record linkage, is the task of identifying records that refer to the same entity. Naive entity matching techniques (i.e., brute-force pairwise comparisons) have quadratic complexity. A typical shortcut to the problem is to employ blocking techniques to reduce the number of comparisons, i.e. to partition the data in several blocks and only compare records within the same bloc...

متن کامل

Adaptive Approximate Record Matching

Typographical data entry errors and incomplete documents, produce imperfect records in real world databases. These errors generate distinct records which belong to the same entity. The aim of Approximate Record Matching is to find multiple records which belong to an entity. In this paper, an algorithm for Approximate Record Matching is proposed that can be adapted automatically with input error...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016